Blog
Engineering notes on custom kernels, local inference, and hardware design.
PFlash: 10× prefill speedup over llama.cpp at 128K on a RTX 3090
Long context overwhelms Q4 27B targets on 24 GB GPUs. PFlash compresses 128K → 2.6K with a small drafter before dflash sees the prompt. Head-to-head cold-vs-cold: 24.8 s TTFT vs ~257 s llama.cpp (10.4×); NIAH retrieval preserved at every measured context.
DFlash on ggml: up to 207 tok/s Qwen3.5-27B on a RTX 3090
Standalone C++/ggml speculative decoder for Qwen3.5-27B Q4_K_M with DFlash block-diffusion draft + DDtree verifier. 3.43x AR, 2.8x SGLang AWQ, 128K context on 24 GB.
The eGPU Myth: Why a ~$300 Dock Won't Turn Your GPU Into an AI Workstation
tinygrad wrote an NVIDIA driver from scratch. We ran real models on an RTX 3090 over USB4. The engineering is brilliant. The numbers aren't there yet. Full benchmarks and profiling.
Megakernel: Matching Apple Silicon Efficiency at 2x the Throughput on a RTX 3090
The first megakernel for hybrid DeltaNet/Attention LLMs. All 24 layers fused into a single CUDA dispatch. 1.87 tok/J, matching M5 Max efficiency at 1.8x the throughput on a 2020 GPU.